Microbiome Analysis

This code is written to practice the workflow for processing and analyzing data from amplicon sequencing experiment. The data used here comes from V4 region of 16s genes. 360 fecal samples were collected from 12 mice over a period of one year to study murine microbiome. The data can be found here.

library(knitr)
library(microbiome)
library(dada2)
library(DECIPHER)
library(phyloseq)
library(phangorn)
library(gridExtra)
library(ggplot2)
library(phyloseqGraphTest)
library(genefilter)
library(impute)
library(dplyr)
library(reshape2)
library(ade4)
library(nlme)
library(DESeq2)
library(PMA)
library("phyloseqGraphTest")
library("ggnetwork")
library(randomForest)
library(structSSI)
library(igraph)

Filter and Trim To begin with, we filter out the low quality reads. So we plot the data quality first, to see where to filter out in this dataset. Here we plot data from first two samples of forward and reverse reads.

# Sorting files with forward and reverse reads
fnFs <- sort(list.files(miseq_path, pattern = "_R1_001.fastq")) # Forward reads
fnRs <- sort(list.files(miseq_path, pattern = "_R2_001.fastq")) # Reverse reads

# Extract sample names:
sampleNames <- sapply(strsplit(fnFs, "_"), `[`, 1)

# specifying full paths to the file lists:
fnFs <- file.path(miseq_path, fnFs)
fnRs <- file.path(miseq_path, fnRs)

# quality of first 2 reads. # Most data shows trend of decreasing average quality towards end of the sequencing reads
dada2::plotQualityProfile(fnFs[1:2])

dada2::plotQualityProfile(fnRs[1:2])

Here, the forward reads maintain high quality throughout, while the quality of the reverse reads drops significantly at about position 160. Therefore, we choose to truncate the forward reads at position 245, and the reverse reads at position 160. We also choose to trim the first 10 nucleotides of each read based on empirical observations across many Illumina datasets that these base positions are particularly likely to contain pathological errors.

We define the filenames for the filtered fastq.gz files:

filt_path <- file.path(miseq_path, "filtered") # To keep filtered files in filtered directories
if(!file_test("-d", filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sampleNames, "_R_filt.fastq.gz"))
out <- dada2::filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen = c(240, 160),
                            maxN = 0, maxEE = c(2,2), truncQ = 2, rm.phix = TRUE,
                            compress = T, multithread = T)
# head(out)

Infer Sequence variants

The next step in the workflow is to cluster the sequences based on a arbitrary fixed dissimilarity threshold, however, with DADA2 package, we can get amplicon sequencing variants exactly, with the resolution of one nucleotide (Benjamin J Callahan et al. 2016).

The sequence data is imported into R from demultiplexed fastq files (i.e. one fastq for each sample) and simultaneously dereplicated to remove redundancy. This done to improve efficiency of computational time.

Dereplication

Dereplication combines all identical sequencing reads into “unique sequences” with a corresponding “abundance”: the number of reads with that unique sequence. This step is essentially converting the data with sequences into counts that can be used for further analysis.

derepFs <- dada2::derepFastq(filtFs, verbose = T)
derepRs <- dada2::derepFastq(filtRs, verbose = T)

# Name the derep class objects by the sample names
names(derepFs) <- sampleNames
names(derepRs) <- sampleNames

Next, we need to distinguish sequencing errors from biological variation and and DADA2 relies on unsupervised learning from the data where sample interference and parameter estimation are alternated with until consistency is seen.

errF <- dada2::learnErrors(filtFs, multithread = T)
## 33514080 total bases in 139642 reads from 20 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = T)
## 22342720 total bases in 139642 reads from 20 samples will be used for learning the error rates.
plotErrors(errF)

plotErrors(errR)

Here, since we see fewer errors that are outliers with respect to the fitted error rates, we can resonably sure of the estimation by the parameterized model.

dadaFs <- dada(derepFs, err=errF, multithread = T, USE_QUALS=TRUE)
## Sample 1 - 7113 reads in 1979 unique sequences.
## Sample 2 - 5299 reads in 1639 unique sequences.
## Sample 3 - 5463 reads in 1477 unique sequences.
## Sample 4 - 2914 reads in 904 unique sequences.
## Sample 5 - 2941 reads in 939 unique sequences.
## Sample 6 - 4312 reads in 1267 unique sequences.
## Sample 7 - 6741 reads in 1756 unique sequences.
## Sample 8 - 4560 reads in 1438 unique sequences.
## Sample 9 - 15637 reads in 3590 unique sequences.
## Sample 10 - 11413 reads in 2762 unique sequences.
## Sample 11 - 12017 reads in 3021 unique sequences.
## Sample 12 - 5032 reads in 1566 unique sequences.
## Sample 13 - 18075 reads in 3707 unique sequences.
## Sample 14 - 6250 reads in 1479 unique sequences.
## Sample 15 - 4052 reads in 1195 unique sequences.
## Sample 16 - 7369 reads in 1832 unique sequences.
## Sample 17 - 4765 reads in 1183 unique sequences.
## Sample 18 - 4871 reads in 1382 unique sequences.
## Sample 19 - 6504 reads in 1709 unique sequences.
## Sample 20 - 4314 reads in 897 unique sequences.
dadaRs <- dada(derepRs, err=errR, multithread = T, USE_QUALS=TRUE)
## Sample 1 - 7113 reads in 1660 unique sequences.
## Sample 2 - 5299 reads in 1349 unique sequences.
## Sample 3 - 5463 reads in 1335 unique sequences.
## Sample 4 - 2914 reads in 853 unique sequences.
## Sample 5 - 2941 reads in 880 unique sequences.
## Sample 6 - 4312 reads in 1286 unique sequences.
## Sample 7 - 6741 reads in 1803 unique sequences.
## Sample 8 - 4560 reads in 1265 unique sequences.
## Sample 9 - 15637 reads in 3414 unique sequences.
## Sample 10 - 11413 reads in 2522 unique sequences.
## Sample 11 - 12017 reads in 2771 unique sequences.
## Sample 12 - 5032 reads in 1415 unique sequences.
## Sample 13 - 18075 reads in 3290 unique sequences.
## Sample 14 - 6250 reads in 1390 unique sequences.
## Sample 15 - 4052 reads in 1134 unique sequences.
## Sample 16 - 7369 reads in 1635 unique sequences.
## Sample 17 - 4765 reads in 1084 unique sequences.
## Sample 18 - 4871 reads in 1161 unique sequences.
## Sample 19 - 6504 reads in 1502 unique sequences.
## Sample 20 - 4314 reads in 732 unique sequences.

Inspecting the dada class object returned by dada

dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 128 sequence variants were inferred from 1979 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

The DADA2 algorithm inferred 128 real sequence variants from the 1979 unique sequences in the first sample. The dada-class object contains multiple diagnostics about the quality of each inferred sequence variant(see help(“dada-class”) for some info).

Now , that the DADA2 sequence inference step removed most substitution and indel errors, (Benjamin J Callahan et al. 2016), we can merge the inferred forward and reverse squences and remove paired sequences that do not perfectly overlap as a last step in quality control of errors.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)

Construct sequence table and remove chimeras:

he DADA2 method produces a sequence table that is a higher-resolution analogue of the common “OTU table” (operational taxonomic units), i.e. a sample by sequence feature table valued by the number of times each sequence was observed in each sample.

seqtabAll <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
table(nchar(getSequences(seqtabAll)))
## 
## 251 252 253 254 255 
##   1  85 186   5   2
# Save and retrieve the seq table
# saveRDS(seqtabAll, file = "microbiome.rds")
# seqtabAll <- readRDS("microbiome.rds")

Since the error model that we used above does not account for chimeric components, we remove them by comparing every inferred sequence to others in the table, and removing the ones that have partial compliments to others sequences.

seqtabNoC <- removeBimeraDenovo(seqtabAll)

Based on the numbers of sequences before and after removal of the chimeric sequences, they make up for 22% of the infered sequence variants. They however make up for no more than 4% of total sequence reads. # Assign Taxonomy: By sequencing 16srRNA genes, we can taxonomically classify sequence variants. By using the naive Bayesian classifier method (Wang et al. 2007), sequence variants are compared to a set of classified sequences for training. Here we use RDP v16 training set (Cole et al. 2009). For fungal taxonomy, the General Fasta release files from the UNITE ITS database can be used without any modification.

fastaRef <- "../data/MiSeq_SOP/rdp_train_set_16.fa.gz"
taxTab <- assignTaxonomy(seqtabNoC, refFasta = fastaRef, multithread = T)
unname(head(taxTab))
##      [,1]       [,2]            [,3]          [,4]           
## [1,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## [2,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## [3,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## [4,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## [5,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## [6,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
##      [,5]                 [,6]         
## [1,] "Porphyromonadaceae" NA           
## [2,] "Porphyromonadaceae" NA           
## [3,] "Porphyromonadaceae" NA           
## [4,] "Porphyromonadaceae" "Barnesiella"
## [5,] "Bacteroidaceae"     "Bacteroides"
## [6,] "Porphyromonadaceae" "Barnesiella"

Construct a Phylogenetic tree:

Using the DECIPHER package, we use the sequences found after removing the chimeras to perform multiple alignment for constructing a phylogenetic tree. The sequence inference method we used does not use a reference, so we construct the phylogenetic tree de novo. First we use DECIPHER for multiple alignment and then use phangorn to contruct a phylogenetic tree. We construct the neighbourhood joining tree first and then fit a GTR+G+I (Generalized time reviersible with Gamma rate variation) maximum likelihood tree using the neighbour-joining tree as the initiating point.

seqs <- getSequences(seqtabNoC)
names(seqs) <- seqs
alignment <- DECIPHER::AlignSeqs(DNAStringSet(seqs), anchor=NA, verbose = F)
phangAlign <- phyDat(as(alignment, "matrix"), type = "DNA")
dm <- dist.ml(phangAlign)
treeNJ <- NJ(dm) 
fit = pml(treeNJ, data = phangAlign)
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model = "GTR", optInv = T, optGamma = T,
                    rearrangement = "stochastic", control = pml.control(trace = 0))
detach("package:phangorn", unload = T)

Combine the data into Phyloseq object:

The package phyloseq organizes and synthesizes the different data types from a typical amplicon sequencing experiment into a single data object that can be easily manipulated. The sample data is downloaded from github:

samdf <- read.csv("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/MIMARKS_Data_combined.csv",header=TRUE)
samdf$SampleID <- paste0(gsub("00", "", samdf$host_subject_id), "D", samdf$age-21)
samdf <- samdf[!duplicated(samdf$SampleID),] # Remove dupicate entries for reverse reads
rownames(seqtabAll) <- gsub("124", "125", rownames(seqtabAll)) # Fix discrepancy # Can be commented for other data
all(rownames(seqtabAll) %in% samdf$SampleID) # Should be TRUE
## [1] TRUE
rownames(samdf) <- samdf$SampleID
keep.cols <- c("collection_date", "biome", "target_gene", "target_subfragment",
"host_common_name", "host_subject_id", "age", "sex", "body_product", "tot_mass",
"diet", "family_relationship", "genotype", "SampleID") 
samdf <- samdf[rownames(seqtabAll), keep.cols]

All the data needed for this analysis including the sequence feature table, metadata, taxonomy information and phylogenetic tree are combines into an object.

ps <- phyloseq(otu_table(seqtabNoC, taxa_are_rows = F),
               sample_data(samdf),
               tax_table(taxTab), phy_tree(fitGTR$tree))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove the "Mock" sample
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 218 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 218 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 218 tips and 216 internal nodes ]

Using phyloseq:

phyloseq is an R package which uses S4 data class to analyse amplicon-seq data. Additional information about the package can be found at phyloseq homepage. The dada sequence processing was run on the large dataset that is stored on Github and is downloaded as follows.

ps_connect <- url("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/ps.rds")
ps = readRDS(ps_connect)
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 389 taxa and 360 samples ]
## sample_data() Sample Data:       [ 360 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 389 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 389 tips and 387 internal nodes ]

Filtering

phyloseq has tools for filtering, subsetting and grouping data. While this can be used to group the data according to taxa, it can also be used to remove noise.

Taxonomic filtering

For many experiments, all organisms are well represented in available taxonomic reference databases. In these conditions, the ambiguous features that cannot be assigned to any high rank taxonomy are usually sequence artifacts that do not exist in nature, and are filtered out. However, for poorly charachterized specimens or novel samples, it is important to rule out the possibility of taxonomic novelty before filtering these sequences. This can be done at various taxon leves as mentioned below.

# Show available ranks in dataset
rank_names(ps)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"

Here we consider rank phylum and create a table of read counts for each phylum present.

# create a table with number of features for each phyla
table(tax_table(ps)[, "Phylum"], exclude=NULL)
## 
##              Actinobacteria               Bacteroidetes 
##                          13                          23 
## Candidatus_Saccharibacteria   Cyanobacteria/Chloroplast 
##                           1                           4 
##         Deinococcus-Thermus                  Firmicutes 
##                           1                         327 
##                Fusobacteria              Proteobacteria 
##                           1                          11 
##                 Tenericutes             Verrucomicrobia 
##                           1                           1 
##                        <NA> 
##                           6

The phyla for which only one feature is observed can be filtered out. The features that are annotated with NA can also be filtered out, as they may be artifacts. We also filter out features with ambiguous annotations.

ps <- subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharachterized"))

We should also explore the prevelance in the dataset where we see the number of samples in which a phylum is seen atleast once. We calculate the mean and total prevalence of each phylum.

# Compute prevelance of each feature
prevdf = apply(X = otu_table(ps),
               MARGIN = ifelse(taxa_are_rows(ps), yes = 1, no = 2),
               FUN = function(x) {sum(x>0)})

# Add taxonomy and total read counts to this data.frame
prevdf = data.frame(Prevalence = prevdf,
                    TotalAbundance = taxa_sums(ps),
                    tax_table(ps))

plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})
##                         Phylum         1     2
## 1               Actinobacteria 120.15385  1562
## 2                Bacteroidetes 265.52174  6107
## 3  Candidatus_Saccharibacteria 280.00000   280
## 4    Cyanobacteria/Chloroplast  64.25000   257
## 5          Deinococcus-Thermus  52.00000    52
## 6                   Firmicutes 179.24771 58614
## 7                 Fusobacteria   2.00000     2
## 8               Proteobacteria  59.09091   650
## 9                  Tenericutes 234.00000   234
## 10             Verrucomicrobia 104.00000   104

With this table, we can decide which taxa we can filter out. Here Fusobacteria have appeared in only 2 samples and Deinococcus thermus is seen in only 1% of all samples. So we can filter these out. In some cases, it might be useful to not filter them out, but here we will.

# Define phyla to filter
filterPhyla = c("Fusobacteria", "Deinococcus-Thermus")
# Filter entries with unidentified Phylum.
ps1 = subset_taxa(ps, !Phylum %in% filterPhyla)
ps1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 381 taxa and 360 samples ]
## sample_data() Sample Data:       [ 360 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 381 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 381 tips and 379 internal nodes ]

We now visualize prevalence in each phyla for exploratory analysis.

# Subset to the remaining phyla
prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum"))
ggplot(prevdf1, aes(TotalAbundance, Prevalence/nsamples(ps), color=Phylum)) +
  # Include a guess for parameter
  geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) +
  geom_point(size = 2, alpha = 0.7) +
  scale_x_log10() +
  xlab("Total Abundance") +
  ylab("Prevalence [Frac. Samples]") +
  facet_wrap(~Phylum) +
  theme(legend.position="none")

Note that each point in the figure above represents a different taxa. So we see maximum representation from Firmicutes in this data that we can explore more.

# Define prevalence threshold as 5% of total samples
prevalenceThreshold = 0.05 * nsamples(ps)
prevalenceThreshold
## [1] 18
# Execute prevalence filter, using `prune_taxa()` function
keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps2 = prune_taxa(keepTaxa, ps)

https://bioconductor.org/help/course-materials/2017/BioC2017/Day1/Workshops/Microbiome/MicrobiomeWorkflowII.html#using_phyloseq ## Agglomerate Taxa Now we agglomerate the taxa in closely related genuses. If the taxonomic agglomeration is defined ahead of time, we can get criteria for grouping the microbial features. Not always accurate or useful, we can still get tree-like graph consisting of nodes called “ranks”. However, if the taxonomy is not available, clustering can be used where the tree height is specified by users with respect to phylogenetic distance between the features to define the grouping. It is different from OTU clustering because unlike OTU clustering, it uses evolutionary information for sequence distance to cluster.

# How many genera would be present after filtering?
length(get_taxa_unique(ps2, taxonomic.rank = "Phylum"))
## [1] 8
ps3 = tax_glom(ps2, "Genus", NArm = TRUE)
h1 = 0.4
ps4 = tip_glom(ps2, h = h1)
multiPlotTitleTextSize = 15
p2tree = plot_tree(ps2, method = "treeonly",
                   ladderize = "left",
                   title = "Before Agglomeration") +
  theme(plot.title = element_text(size = multiPlotTitleTextSize))
p3tree = plot_tree(ps3, method = "treeonly",
                   ladderize = "left", title = "By Genus") +
  theme(plot.title = element_text(size = multiPlotTitleTextSize))
p4tree = plot_tree(ps4, method = "treeonly",
                   ladderize = "left", title = "By Height") +
  theme(plot.title = element_text(size = multiPlotTitleTextSize))

# group plots together
grid.arrange(nrow = 1, p2tree, p3tree, p4tree)

Abundance Value Transformation

We can explore relative abundance in each taxon with following snippet.

plot_abundance = function(physeq,title = "",
                          Facet = "Order", Color = "Phylum"){
  # Arbitrary subset, based on Phylum, for plotting
  p1f = subset_taxa(physeq, Phylum %in% c("Firmicutes"))
  mphyseq = psmelt(p1f)
  mphyseq <- subset(mphyseq, Abundance > 0)
  ggplot(data = mphyseq, mapping = aes_string(x = "sex",y = "Abundance",
                              color = Color, fill = Color)) +
    geom_violin(fill = NA) +
    geom_point(size = 1, alpha = 0.3,
               position = position_jitter(width = 0.3)) +
    facet_wrap(facets = Facet) + scale_y_log10()+
    theme(legend.position="none")
}
# Transform to relative abundance. Save as new object.
ps3ra = transform_sample_counts(ps3, function(x){x / sum(x)})

plotBefore = plot_abundance(ps3,"")
plotAfter = plot_abundance(ps3ra,"")
# Combine each plot into one graphic.
grid.arrange(nrow = 2,  plotBefore, plotAfter)

Subset by taxonomy

Notice on the previous plot that Lactobacillales appears to be a taxonomic Order with bimodal abundance profile in the data. We can check for a taxonomic explanation of this pattern by plotting just that taxonomic subset of the data. For this, we subset with the function, and then specify a more precise taxonomic rank to the argument of the function that we defined above.

psOrd = subset_taxa(ps3ra, Order == "Lactobacillales")
plot_abundance(psOrd, Facet = "Genus", Color = "Genus")

psOrd = subset_taxa(ps3ra, Order == "Clostridiales")
plot_abundance(psOrd, Facet = "Family", Color = "Genus")

Preprocessing

We want to test if the age of mice has any effect on clustering. S

qplot(sample_data(ps)$age, geom = "histogram",binwidth=20) + xlab("age")

Here we see, that the age co-variate belongs to three seperate clusters.

qplot(log10(rowSums(otu_table(ps))),binwidth=0.2) +
  xlab("Logged counts-per-sample")

These graphs suggest that a log(1+x) transformations might be sufficient for normalizing the abundance data for exploratory analysis. We look at the principal coordinates analysis (PCoA) with the Bray-Curtis dissimilarity on the weighted Unifrac distance. H

sample_data(ps)$age_binned <- cut(sample_data(ps)$age,
                          breaks = c(0, 100, 200, 400))
levels(sample_data(ps)$age_binned) <- list(Young100="(0,100]", Mid100to200="(100,200]", Old200="(200,400]")
sample_data(ps)$family_relationship=gsub(" ","",sample_data(ps)$family_relationship)
pslog <- transform_sample_counts(ps, function(x) log(1 + x))
out.wuf.log <- ordinate(pslog, method = "MDS", distance = "wunifrac")
evals <- out.wuf.log$values$Eigenvalues
plot_ordination(pslog, out.wuf.log, color = "age_binned") +
  labs(col = "Binned Age") +
  coord_fixed(sqrt(evals[2] / evals[1]))

Here we take some of the outliers out, as we are only concerned with the relationships between the non-outlier points.

rel_abund <- t(apply(otu_table(ps), 1, function(x) x / sum(x)))
qplot(rel_abund[, 12], geom = "histogram",binwidth=0.05) +
  xlab("Relative abundance")

Different Orientation projections

Here we do unsupervized exploratory analysis.

Here we remove the outliers, and samples with less than 1000 reads. Then we run a PCoA using Bray-Curtis similarity.

outliers <- c("F5D165", "F6D165", "M3D175", "M4D175", "M5D175", "M6D175")
ps <- prune_samples(!(sample_names(ps) %in% outliers), ps)
which(!rowSums(otu_table(ps)) > 1000)
## F5D145 M1D149   M1D9 M2D125  M2D19 M3D148 M3D149   M3D3   M3D5   M3D8 
##     69    185    200    204    218    243    244    252    256    260
ps <- prune_samples(rowSums(otu_table(ps)) > 1000, ps)
pslog <- transform_sample_counts(ps, function(x) log(1 + x))
out.pcoa.log <- ordinate(pslog,  method = "MDS", distance = "bray")
evals <- out.pcoa.log$values[,1]
plot_ordination(pslog, out.pcoa.log, color = "age_binned",
                  shape = "family_relationship") +
  labs(col = "Binned Age", shape = "Litter")+
  coord_fixed(sqrt(evals[2] / evals[1]))

Here we see that the effect of age is fairly consistent between the all the mice, irrespective of the gender and litter.

We also look at double principal coordinates analysis (DPCoA) which gives a biplot representation of both samples and taxonomic categories.

out.dpcoa.log <- ordinate(pslog, method = "DPCoA")
evals <- out.dpcoa.log$eig
plot_ordination(pslog, out.dpcoa.log, color = "age_binned", label= "SampleID",
                  shape = "family_relationship") +
  labs(col = "Binned Age", shape = "Litter")+
  coord_fixed(sqrt(evals[2] / evals[1]))

plot_ordination(pslog, out.dpcoa.log, type = "species", color = "Phylum") +
  coord_fixed(sqrt(evals[2] / evals[1]))

Here we see the results of PCoA with weighted unifrac, where we see second axis is associated with age effect, however, this does not take abundance into account.

out.wuf.log <- ordinate(pslog, method = "PCoA", distance ="wunifrac")
evals <- out.wuf.log$values$Eigenvalues
plot_ordination(pslog, out.wuf.log, color = "age_binned",
                  shape = "family_relationship") +
  coord_fixed(sqrt(evals[2] / evals[1])) +
  labs(col = "Binned Age", shape = "Litter")

PCA on Ranks

Because microbial abundance data is tail-heavy, it is easier to ignore abundances all together, and work with ranks instead.

abund <- otu_table(pslog)
abund_ranks <- t(apply(abund, 1, rank))
abund_ranks <- abund_ranks - 329
abund_ranks[abund_ranks < 1] <- 1
abund_df <- melt(abund, value.name = "abund") %>%
  left_join(melt(abund_ranks, value.name = "rank"))
colnames(abund_df) <- c("sample", "seq", "abund", "rank")

abund_df <- melt(abund, value.name = "abund") %>%
  left_join(melt(abund_ranks, value.name = "rank"))
colnames(abund_df) <- c("sample", "seq", "abund", "rank")

sample_ix <- sample(1:nrow(abund_df), 8)
ggplot(abund_df %>%
         filter(sample %in% abund_df$sample[sample_ix])) +
  geom_point(aes(x = abund, y = rank, col = sample),
             position = position_jitter(width = 0.2), size = 1.5) +
  labs(x = "Abundance", y = "Thresholded rank") +
  scale_color_brewer(palette = "Set2")

This is rank threshold transformation.

ranks_pca <- dudi.pca(abund_ranks, scannf = F, nf = 3)
row_scores <- data.frame(li = ranks_pca$li,
                         SampleID = rownames(abund_ranks))
col_scores <- data.frame(co = ranks_pca$co,
                         seq = colnames(abund_ranks))
tax <- tax_table(ps) %>%
  data.frame(stringsAsFactors = FALSE)
tax$seq <- rownames(tax)
main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales",
                 "Coriobacteriales")
tax$Order[!(tax$Order %in% main_orders)] <- "Other"
tax$Order <- factor(tax$Order, levels = c(main_orders, "Other"))
tax$otu_id <- seq_len(ncol(otu_table(ps)))
row_scores <- row_scores %>%
  left_join(sample_data(pslog))
col_scores <- col_scores %>%
  left_join(tax)
evals_prop <- 100 * (ranks_pca$eig / sum(ranks_pca$eig))
ggplot() +
  geom_point(data = row_scores, aes(x = li.Axis1, y = li.Axis2), shape = 2) +
  geom_point(data = col_scores, aes(x = 25 * co.Comp1, y = 25 * co.Comp2, color =
                                    Order), size = .3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  
  facet_grid(~ age_binned) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)),
  y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) +
  coord_fixed(sqrt(ranks_pca$eig[2] / ranks_pca$eig[1])) +
  # theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))
  theme_minimal()

We performed the PCA and the result is similar to PCoA without applying truncated ranking transformation, proving the infalliability of our analysis of the original data.

Cannonical correspondance:

Canonical Correspondence Analysis (CCpnA) is a method used for ordinating species by sample tables while integrating additional information about the samples. Just as in previous methods, the aim of generating biplots is to discern the predominant types of bacterial communities across various mouse sample types.

ps_ccpna <- ordinate(pslog, "CCA", formula = pslog ~ age_binned + family_relationship)
library(ggrepel)
ps_scores <- vegan::scores(ps_ccpna)
sites <- data.frame(ps_scores$sites)
sites$SampleID <- rownames(sites)
sites <- sites %>%
  left_join(sample_data(ps))

species <- data.frame(ps_scores$species)
species$otu_id <- seq_along(colnames(otu_table(ps)))
species <- species %>%
  left_join(tax)
evals_prop <- 100 * ps_ccpna$CCA$eig[1:2] / sum(ps_ccpna$CA$eig)
ggplot() +
  geom_point(data = sites, aes(x = CCA1, y = CCA2), shape = 2, alpha = 0.5) +
  geom_point(data = species, aes(x = CCA1, y = CCA2, col = Order), size = 0.5) +
  geom_text_repel(data = species %>% filter(CCA2 < -2),
                    aes(x = CCA1, y = CCA2, label = otu_id),
            size = 1.5, segment.size = 0.1) +
  facet_grid(. ~ family_relationship) +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  labs(x = sprintf("Axis1 [%s%% variance]", round(evals_prop[1], 2)),
       y = sprintf("Axis2 [%s%% variance]", round(evals_prop[2], 2))) +
  scale_color_brewer(palette = "Set2") +
  coord_fixed(sqrt(ps_ccpna$CCA$eig[2] / ps_ccpna$CCA$eig[1])*0.45) +
  # theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))
  theme_minimal()

Supervised learning

We attempt to apply supervised learning to try and predict age from the microbiome composition.

library(caret)
sample_data(pslog)$age2 <- cut(sample_data(pslog)$age, c(0, 100, 400))
dataMatrix <- data.frame(age = sample_data(pslog)$age2, otu_table(pslog))
# take 8 mice at random to be the training set, and the remaining 4 the test set
trainingMice <- sample(unique(sample_data(pslog)$host_subject_id), size = 8)
inTrain <- which(sample_data(pslog)$host_subject_id %in% trainingMice)
training <- dataMatrix[inTrain,]
testing <- dataMatrix[-inTrain,]
plsFit <- train(age ~ ., data = training,
                method = "pls", preProc = "center")
plsClasses <- predict(plsFit, newdata = testing)
table(plsClasses, testing$age)
##            
## plsClasses  (0,100] (100,400]
##   (0,100]        66         0
##   (100,400]       2        45

Random forests can be tried too.

library(randomForest)
rfFit <- train(age ~ ., data = training, method = "rf",
               preProc = "center", proximity = TRUE)
rfClasses <- predict(rfFit, newdata = testing)
table(rfClasses, testing$age)
##            
## rfClasses   (0,100] (100,400]
##   (0,100]        66         1
##   (100,400]       2        44
pls_biplot <- list("loadings" = loadings(plsFit$finalModel),
                   "scores" = plsFit[["finalModel"]][["scores"]])
class(pls_biplot$scores) <- "matrix"

pls_biplot$scores <- data.frame(sample_data(pslog)[inTrain, ],
                                pls_biplot$scores)

tax <- tax_table(ps)@.Data %>%
  data.frame(stringsAsFactors = FALSE)
main_orders <- c("Clostridiales", "Bacteroidales", "Lactobacillales",
                 "Coriobacteriales")
tax$Order[!(tax$Order %in% main_orders)] <- "Other"
tax$Order <- factor(tax$Order, levels = c(main_orders, "Other"))
class(pls_biplot$loadings) <- "matrix"
pls_biplot$loadings <- data.frame(tax, pls_biplot$loadings)
ggplot() +
  geom_point(data = pls_biplot$scores,
             aes(x = Comp.1, y = Comp.2), shape = 2) +
  geom_point(data = pls_biplot$loadings,
             aes(x = 25 * Comp.1, y = 25 * Comp.2, col = Order),
             size = 0.3, alpha = 0.6) +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Axis1", y = "Axis2", col = "Binned Age") +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  facet_grid( ~ age2) +
  # theme(panel.border = element_rect(color = "#787878", fill = alpha("white", 0)))
  theme_minimal()

Random forests classify with bootstrap procedure. The distance is measured by how ofte same sample occurs in the same partition. The resulting distance is then inserted into PCoA.

rf_prox <- cmdscale(1 - rfFit$finalModel$proximity) %>%
  data.frame(sample_data(pslog)[inTrain, ])

ggplot(rf_prox) +
  geom_point(aes(x = X1, y = X2, col = age_binned),
             size = 1, alpha = 0.7) +
  scale_color_manual(values = c("#A66EB8", "#238DB5", "#748B4F")) +
  guides(col = guide_legend(override.aes = list(size = 4))) +
  labs(col = "Binned Age", x = "Axis1", y = "Axis2")

Here the class seperation is fairly distinct. Manually selecting the points can give more information about the samples that were difficult to classify. For example the old aged samples have not shown any clear clustering together. There are also very few samples of older than 200 days.

as.vector(tax_table(ps)[which.max(importance(rfFit$finalModel)), c("Family", "Genus")])
## [1] "Lachnospiraceae" "Roseburia"

To identify which species has most influence on the random forest model, we ran the last snippet and found that Roseburia from family Lachnospiraceae is the most abundant. Below, we plot it’s abundance across the samples.

impOtu <- as.vector(otu_table(pslog)[,which.max(importance(rfFit$finalModel))])
maxImpDF <- data.frame(sample_data(pslog), abund = impOtu)
ggplot(maxImpDF) +   geom_histogram(aes(x = abund)) +
  facet_grid(age2 ~ .) +
  labs(x = "Abundance of discriminative bacteria", y = "Number of samples")

net <- make_network(ps, max.dist=0.35, keep.isolates = FALSE)
sampledata <- data.frame(sample_data(ps))
V(net)$id <- sampledata[names(V(net)), "host_subject_id"]
V(net)$litter <- sampledata[names(V(net)), "family_relationship"]

ggplot(net, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(color = "darkgray") +
  geom_nodes(aes(color = id, shape = litter),  size = 3 ) +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        legend.key.height = unit(0.5,"line")) +
  guides(col = guide_legend(override.aes = list(size = .5)))

In this graph, we can see that the samples are grouped based on mouse from which the sample was collected and the litter of the mouse.

Graph based two-sample tests:

In this test, we make the minimum spanning tree (MST) based on the distance between samples and then counting the number of edges on the tree that were between samples in different groups.

gt <- graph_perm_test(ps, "family_relationship", grouping = "host_subject_id",
                      distance = "jaccard", type = "mst")
gt$pval
## [1] 0.01

Here since the p-value is very small, we can safely say that the two samples come from different distributions.

plotNet1=plot_test_network(gt) + theme(legend.text = element_text(size = 8),
        legend.title = element_text(size = 9))
plotPerm1=plot_permutations(gt)
grid.arrange(ncol = 2,  plotNet1, plotPerm1)

We see than the samples group by the litter more than we would expect by chance.

gt <- graph_perm_test(ps, "family_relationship", grouping = "host_subject_id",
                      distance = "jaccard", type = "knn", knn = 1)

plotNet2=plot_test_network(gt) + theme(legend.text = element_text(size = 8),
        legend.title = element_text(size = 9))
plotPerm2=plot_permutations(gt)
grid.arrange(ncol = 2,  plotNet2, plotPerm2)

The k-nearest neighbour graph also reinforces that the samples are grouping by litter.

Linear Modelling:

ps_alpha_div <- estimate_richness(ps, split = TRUE, measure = "Shannon")
ps_alpha_div$SampleID <- rownames(ps_alpha_div) %>%
  as.factor()
ps_samp <- sample_data(ps) %>%
  unclass() %>%
  data.frame() %>%
  left_join(ps_alpha_div, by = "SampleID") %>%
  melt(measure.vars = "Shannon",
       variable.name = "diversity_measure",
       value.name = "alpha_diversity")

# reorder's facet from lowest to highest diversity
diversity_means <- ps_samp %>%
  group_by(host_subject_id) %>%
  summarise(mean_div = mean(alpha_diversity)) %>%
  arrange(mean_div)
ps_samp$host_subject_id <- factor(ps_samp$host_subject_id)
#                                  diversity_means$host_subject_id)
alpha_div_model <- lme(fixed = alpha_diversity ~ age_binned, data = ps_samp,
                       random = ~ 1 | host_subject_id)
new_data <- expand.grid(host_subject_id = levels(ps_samp$host_subject_id),
                        age_binned = levels(ps_samp$age_binned))
new_data$pred <- predict(alpha_div_model, newdata = new_data)
X <- model.matrix(eval(eval(alpha_div_model$call$fixed)[-2]),
                  new_data[-ncol(new_data)])
pred_var_fixed <- diag(X %*% alpha_div_model$varFix %*% t(X))
new_data$pred_var <- pred_var_fixed + alpha_div_model$sigma ^ 2
# fitted values, with error bars
ggplot(ps_samp %>% left_join(new_data)) +
  geom_errorbar(aes(x = age_binned, ymin = pred - 2 * sqrt(pred_var),
                    ymax = pred + 2 * sqrt(pred_var)),
                col = "#858585", size = .1) +
  geom_point(aes(x = age_binned, y = alpha_diversity,
                 col = family_relationship), size = 0.8) +
  facet_wrap(~host_subject_id) +
  scale_y_continuous(limits = c(2.4, 4.6), breaks = seq(0, 5, .5)) +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Binned Age", y = "Shannon Diversity", color = "Litter") +
  guides(col = guide_legend(override.aes = list(size = 4))) +
  theme(axis.text.x = element_text(angle = -90, size = 6),
        axis.text.y = element_text(size = 6))

Here we see how a single measure of the community structure is associated with the sample characteristics. We can see how the microbial community diversity reflects the environment.

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] caret_6.0-94                lattice_0.22-6             
##  [3] ggrepel_0.9.5               igraph_2.0.3               
##  [5] structSSI_1.2.0             randomForest_4.7-1.1       
##  [7] ggnetwork_0.5.13            PMA_1.2-3                  
##  [9] DESeq2_1.42.1               SummarizedExperiment_1.32.0
## [11] Biobase_2.62.0              MatrixGenerics_1.14.0      
## [13] matrixStats_1.3.0           GenomicRanges_1.54.1       
## [15] nlme_3.1-164                ade4_1.7-22                
## [17] reshape2_1.4.4              dplyr_1.1.4                
## [19] impute_1.76.0               genefilter_1.84.0          
## [21] phyloseqGraphTest_0.1.1     gridExtra_2.3              
## [23] ape_5.8                     DECIPHER_2.30.0            
## [25] RSQLite_2.3.6               Biostrings_2.70.3          
## [27] GenomeInfoDb_1.38.8         XVector_0.42.0             
## [29] IRanges_2.36.0              S4Vectors_0.40.2           
## [31] BiocGenerics_0.48.1         dada2_1.30.0               
## [33] Rcpp_1.0.12                 microbiome_1.24.0          
## [35] ggplot2_3.5.0               phyloseq_1.46.0            
## [37] knitr_1.46                 
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.1            bitops_1.0-7             tibble_3.2.1            
##   [4] hardhat_1.3.1            pROC_1.18.5              XML_3.99-0.16.1         
##   [7] rpart_4.1.23             lifecycle_1.0.4          globals_0.16.3          
##  [10] MASS_7.3-60.0.1          magrittr_2.0.3           sass_0.4.9              
##  [13] rmarkdown_2.26           jquerylib_0.1.4          DBI_1.2.2               
##  [16] RColorBrewer_1.1-3       lubridate_1.9.3          abind_1.4-5             
##  [19] ShortRead_1.60.0         zlibbioc_1.48.2          Rtsne_0.17              
##  [22] quadprog_1.5-8           purrr_1.0.2              RCurl_1.98-1.14         
##  [25] nnet_7.3-19              ipred_0.9-14             lava_1.8.0              
##  [28] GenomeInfoDbData_1.2.11  listenv_0.9.1            vegan_2.6-4             
##  [31] annotate_1.80.0          parallelly_1.37.1        permute_0.9-7           
##  [34] codetools_0.2-20         DelayedArray_0.28.0      tidyselect_1.2.1        
##  [37] farver_2.1.1             GenomicAlignments_1.38.2 jsonlite_1.8.8          
##  [40] multtest_2.58.0          e1071_1.7-14             survival_3.5-8          
##  [43] iterators_1.0.14         foreach_1.5.2            tools_4.3.1             
##  [46] glue_1.7.0               prodlim_2023.08.28       SparseArray_1.2.4       
##  [49] xfun_0.43                mgcv_1.9-1               withr_3.0.0             
##  [52] fastmap_1.1.1            latticeExtra_0.6-30      rhdf5filters_1.14.1     
##  [55] fansi_1.0.6              digest_0.6.35            timechange_0.3.0        
##  [58] R6_2.5.1                 colorspace_2.1-0         jpeg_0.1-10             
##  [61] utf8_1.2.4               tidyr_1.3.1              generics_0.1.3          
##  [64] pls_2.8-3                data.table_1.15.4        recipes_1.0.10          
##  [67] class_7.3-22             httr_1.4.7               S4Arrays_1.2.1          
##  [70] ModelMetrics_1.2.2.2     pkgconfig_2.0.3          gtable_0.3.4            
##  [73] timeDate_4032.109        blob_1.2.4               hwriter_1.3.2.1         
##  [76] htmltools_0.5.8.1        biomformat_1.30.0        scales_1.3.0            
##  [79] png_0.1-8                gower_1.0.1              rstudioapi_0.16.0       
##  [82] proxy_0.4-27             cachem_1.0.8             rhdf5_2.46.1            
##  [85] stringr_1.5.1            AnnotationDbi_1.64.1     pillar_1.9.0            
##  [88] grid_4.3.1               vctrs_0.6.5              xtable_1.8-4            
##  [91] cluster_2.1.6            evaluate_0.23            cli_3.6.2               
##  [94] locfit_1.5-9.9           compiler_4.3.1           Rsamtools_2.18.0        
##  [97] rlang_1.1.3              crayon_1.5.2             future.apply_1.11.2     
## [100] labeling_0.4.3           interp_1.1-6             plyr_1.8.9              
## [103] stringi_1.8.3            deldir_2.0-4             BiocParallel_1.36.0     
## [106] munsell_0.5.1            Matrix_1.6-5             bit64_4.0.5             
## [109] future_1.33.2            Rhdf5lib_1.24.2          KEGGREST_1.42.0         
## [112] highr_0.10               memoise_2.0.1            RcppParallel_5.1.7      
## [115] bslib_0.7.0              fastmatch_1.1-4          bit_4.0.5